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Facial Expressions Tracking and Recognition: Database Protocols for Systems Validation 

and Evaluation 


Abstract 

Each human face is unique. It has its own shape, topology, and distinguishing features. As such, developing and testing 
facial tracking systems are challenging tasks. The existing face recognition and tracking algorithms in Computer Vision mainly 
specify concrete situations according to particular goals and applications, requiring validation methodologies with data that fits 
their purposes. However, a database that covers all possible variations of external and factors does not exist, increasing researchers’ 
work in acquiring their own data or compiling groups of databases. 

To address this shortcoming, we propose a methodology for facial data acquisition through definition of fundamental variables, 
such as subject characteristics, acquisition hardware, and performance parameters. Following this methodology, we also propose 
two protocols that allow the capturing of facial behaviors under uncontrolled and real-life situations. As validation, we executed both 
protocols which lead to creation of two sample databases: FdMiee (Facial database with Multi input, expressions, and environments) 
and FACIA (Facial Multimodal database driven by emotional induced acting). 

Using different types of hardware, FdMiee captures facial information under environmental and facial behaviors variations. 
FACIA is an extension of FdMiee introducing a pipeline to acquire additional facial behaviors and speech using an emotion-acting 
method. Therefore, this work eases the creation of adaptable database according to algorithm’s requirements and applications, 
leading to simplified validation and testing processes. 
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1. Introduction 

In the field of Computer Vision (CV), there are several ex¬ 
isting databases that contain a wide range of facial expressions 
and behaviors, developed based on specific scenarios. The data 
contained in these databases is usually used for validation and 
performance tests, as well as training of facial models in CV 
algorithms dElEl. To date, computational works include only 
a limited number of features though, representing typical facial 
extraction elements In fact, there is no single database 

that integrates a full set of situations: some are dedicated only 
to expressions, others to lighting conditions, some are just for 
extracting facial patterns used to define training models, oth¬ 
ers for emotion classification, etc. This means the information 
is split across a variety of databases, making it impossible to 
validate a facial tracking system under numerous specific situ¬ 
ations (e.g. partial face occlusions from hardware or glasses, 
changes in background, variations in illumination, head pose 
variations, etc...) or train emotion classifier systems capable 
of capturing the subtleties of the face using only one database. 
This drawback usually leads to systems over-fitting to data, pre¬ 
senting a high specificity to a certain environment or limiting 
facial features recognized Q. Therefore, every time it is re¬ 
quired to design validation and performance tests or training 
sets, researchers struggle to find databases that fit all system’s 
requirements Q. As example, to deploy the recent face track¬ 
ing system m it was needed the compilation of three different 
databases. In alternative, researchers define and setup their own 


procedures to acquire own databases, collecting subjects, defin¬ 
ing protocols, and preparing capture equipment - which are 
all time-consuming processes. This ’’database customization” 
requirement exists since databases require specific features or 
formats (e.g. high-resolution videos and infra-red pictures) ac¬ 
cording to CV system’s profile and goal. These features and 
formats contain a wide range of variations in external and facial 
behavior parameters to simulate real-life situations and provide 
information that would reproduce the scenario accurately where 
the system is going to be applied m 

In this work, we designed two generic protocols and de¬ 
veloped a methodology for data acquisition for face recogni¬ 
tion systems, as well as for tracking and training of CV algo¬ 
rithms. Our methodology defines each acquisition protocol to 
be composed of three basic variables: i) subject characteris¬ 
tics, ii) acquisition hardware and in) performance parameters. 
These variables are classified as flexible (i.e. can be altered ac¬ 
cording to system requirements, not influencing protocol guide¬ 
lines) and fixed (i.e. defined and constrained b the protocol 
guidelines). The flexible variables are connected to system re¬ 
quirements, and the fixed ones to the information recorded and 
simulated. As performance variables, we define the following 
parameters: external (e.g. environment changes in lightning 
and background) and facial (e.g. variations in facial expres¬ 
sions and their intensity). To test the accuracy and performance 
of algorithms in facial features tracking or to train face mod¬ 
els, used databases need to contain a broad set of external and 
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Figure 1: Summary of database protocols’ contributions (B) in facial database 
universe (A). 

facial behavior variations. Setting up these variables through 
our proposed methodology and adopting our protocols eases the 
process of acquisition of databases with facial information un¬ 
der real-life scenarios and realistic facial behaviors. To validate 
this process, we followed both protocols and acquired two sam¬ 
ple databases. We also analysed the obtained results to establish 
proof-of-concept. 

We dubbed the first protocol Protocol I that generated Fd- 
Miee “Facial database with multi input, expressions and envi¬ 
ronments”. Protocol I aims to guide researchers through ac¬ 
quiring data using three capture hardware while varying the 
performance variable, giving special focus to external param¬ 
eters variation. As Protocol I’s extension. Protocol II intro¬ 
duces variations in performance variable regarding facial be¬ 
haviors. Validation of this protocol generated FACIA “Facial 
Multimodal database driven by emotional induced acting”. 

Figure represents our overall contribution schematically, 
regarding the types of data captured in the protocols. It repre¬ 
sents the Facial databases’ universe through Environment situ¬ 
ations and conditions, where we include the group of available 
facial behaviors, with a small part reserved to introduce behav¬ 
iors (Figure A). Taking this scheme into account, we can 
mirror the domain of our database protocol and represent the 
contributions of FdMiee and FACIA diagrammatically (Figure 
□-B). 

2. Background 

To develop guidelines for database acquisition, we researched 
the literature for methodologies and variance parameters re¬ 
quired to test and evaluate CV systems. We analyzed state-of- 
the-art databases, and classified them into two groups, accord¬ 
ing to their output format: video and image-based. The most 
commonly-used video databases are as follows: 

• BU-4DFE (3D capture + temporal information): A 3D 
Dynamic Facial Expression Database HI; 

• BP4D-Spontaneous: a high-resolution spontaneous 3D 
dynamic facial expression database 121 ; 

• MMI Facial Expression Database m. 


• VidTIMIT Audio-Video Database ifTOl : 

• Face Video Database of the Max Plank Institute HD. 

Comprehensive and well-documented video databases ex¬ 
ist, for example |[T^ and {T3\ . However, to access them, a 
very strict license must be procured and a payment provided. 
BU-4DFE im presents a high-resolution 3D dynamic facial ex¬ 
pression database. Facial expressions are captured at 25 frames- 
per-second while performing six basic Ekman’s emotions. Each 
expression sequence contains about 100 frames spread through 
101 subjects. More recently, this database was extended to cre¬ 
ate a 3D spontaneous facial expressions i). Another facial ex¬ 
pressions database commonly used is the MMI database Q. It 
is an ongoing project that holds over 2000 videos and more than 
500 images from 50 subjects. Also information of displayed 
All’s is given with the samples. The VidTIMIT Audio-Video 
Col contains video and audio recordings from 43 people recit¬ 
ing 10 short sentences per person. Each person also performs a 
head rotation sequence per session, which in facial recognition 
can allow pose independence. Finally, Face Video Database 
from the Max Planck Institute provides videos of facial action 
units, used for Face and Object Recognition, though no more 
information is given lfTT]| . Usage of videos instead of images 
on the model training allows a better detection of spontaneous 
and subtle facial movements. However, available databases are 
limited to standard facial expressions detection |[TJ|2l or do not 
explore situations with different lighting levels. 

Regarding image-based databases, we came across a com¬ 
parison study in the table VIII of O . This table describes the 
commonly-used image-based databases for validation of face 
tracking systems. It also exposes their limitations. As exam¬ 
ples of current image-based databases, we analyzed the follow¬ 
ing databases: 

• Yale Cl; 

• YaleB CS; 

• the FERET Cl: 

• CMU Pose, Illumination and Expression (PIE) C3; 

• Oulu Physics Cl . 

Regarding Yale d and Yale B ca database, it contains 
a limited number of grayscale images with well-documented 
variations on lighting, facial expressions, and pose variations. 
In contrast, the FERET database ifT^ has a high number of 
subjects with a complete pose variation. However, no informa¬ 
tion about lighting is given. Another interesting database is the 
CMU PIE iFTTl which also tests extreme lighting variations for 
68 subjects. These three databases are frequently used for facial 
recognition, not only for model training but also for validation. 
Einally, we also highlight the Oulu Physics ifTSl database, since 
it presents a variation on lighting color (horizon, incandescent, 
fluorescent, and daylight) on 125 faces. 

Based on this research, we concluded that there is a wide 
range of databases that explore and simulate diverse facial ex¬ 
pressions under different environment conditions. However, the 
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available information is spread throughout many databases. In 
other words, a single database that combines all these facial and 
environment behaviors and variations providing a complete tool 
for validation of facial expressions tracking and classification is 
still non-existent. 

In (191, a complete state-of-the-art on emotional databases 
available nowadays can be found. We searched for a facial ex¬ 
pressions database that would simultaneously provide color and 
depth video (3D data stream) as well as speech information, 
along with emotional data. Our search criteria, however, were 
not fulfilled. 

The increase of affect recognition CV methods 1201 lead 
to a necessity of databases generation containing spontaneous 
expressions. To establish how to induce these expressions in 
participants, we analyzed the review paper on Mood Induction 
Procedures (MIP’s) ED and investigated which resource ma¬ 
terials could be used to enhance and introduce realism in ex¬ 
pressed emotions ll22l . We concluded that the most commonly- 
used emotion induction procedure is the Velten method, char¬ 
acterized by a self-referent statement technique. However, the 
most powerful techniques are combinations of different MIPs, 
such as Imagination, Movies/Films instructions or Music ED 
. Therefore, the technique chosen for our experiment was a 
combination of the Velten technique with imagination, where 
we proposed an emotional sentence enacting, similar to the one 
presented by Martin et al 1221 . 

Some available databases that use similar MIP’s induce emo¬ 
tions in the users by asking them to imagine themselves in cer¬ 
tain and pre-defined situations ||23l |24l . However, the usage of 
this procedure without complementary material (e.g. sentences) 
does not guarantee facial expressivity from the user 1231 \T\\ . 
Since we intended to record speech, we analysed state-of-the- 
art multimodal databases im and found that there was none 
containing Portuguese speech. Therefore, we decided to ex¬ 
plore this potential research avenue. 

3. Protocol Methodology 

Analysing the background and details of facial data acqui¬ 
sition setups, we propose that to create a protocol, three funda¬ 
mental variables need to be characterized: subject characteris¬ 
tics, acquisition hardware and performance parameters (Table 

0 - 

These variables are classified as being either flexible or fixed, 
according to their impact on the protocol guidelines. Subject 
characteristics and acquisition hardware are flexible variables, 
as they can be changed according to system requirements. For 
example, use male subjects captured with a high-speed camera 
or other kind of hardware available, since they do not influence 
the guidelines of acquisition itself, but only interfere with the 
acquisition setup. In contrast, fixed variables such as perfor¬ 
mance parameters, influence guidelines definitions, i.e. differ¬ 
ent performance parameters require us to take different steps for 
their simulation and acquisition. 

Subject characteristics include gender, age, race, and other 
features that can be extrapolated from the subjects’ samples. 


Table 1: Protocol flexible and fixed variables. 


Protocol Variables 

Flexible 

Fixed 

Subjects 

Characteristics 

Acquisition 

Hardware 

Performance 

Parameters 

Gender 

Age 

Race 

(...) 

Webcam 

HD Camera 

Infra-Red Camera 
Microsoft Kinect 
High-Speed Camera 

(...) 

External Parameters: 
Background 
Lightning 
Multi-Subject 
Occlusions 

Facial Parameters: 
Head Rotation 
Expressions: 
Macro 

Micro 

False 

Masked 

Subtle 

Speech 


This variable introduces specific facial behaviors (e.g. cultural 
variations in emotion expressions) in the database. Regarding, 
acquisition hardware, we enabled the usage of any type of in¬ 
put hardware according to acquisition specifications. Differ¬ 
ent combinations of these flexible variables can be applied to 
any of the fixed performance parameters guidelines. Perfor¬ 
mance variables describe the procedures for acquiring the data 
required for performance tests of CV algorithms. They are split 
into External and Facial categories, according to what we want 
to test. External parameters are related to changes in the envi¬ 
ronment, such as background, lightning, number of persons in 
a scene (i.e. multi-subject), and occlusions EH ESI [271. These 
variables are almost infinite ll28l due to their uncontrolled na¬ 
ture in real-life environments. Facial behaviours should contain 
facial expressions data triggered by emotions, such as macro, 
micro, subtle, false, and masked expressions (291 ESI ED or 
even speech information. Ekman et al (291 defines six univer¬ 
sal emotions: anger, fear, sadness, disgust, surprise and happi¬ 
ness. These universal emotions are expressed in different ways 
according to a person’s mood and intentions. The way they are 
expressed leads us to an expressions-classification: 

• Macro: These expressions last between half a second and 
4 seconds. They often repeat and fit what is being said as 
well as the speech. Facial expressions of high intensity 
are usually connected to six universal emotions (291(301 : 

• Micro: Brief facial expressions (e.g. milliseconds) re¬ 
lated to emotion suppression or repression (2^l30l : 

• False: Mirrors an emotion that is deliberately performed, 
ans is not being felt (29ll30l : 

• Masked: False expression created to mask a felt macro¬ 
expression (291(301 : 

• Subtle: Expressions of low intensity that occur when a 
person starts to feel an emotion or shows an emotional 
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response to a certain situation, another person, or sur¬ 
rounding environment. This is usually of low intensity 

ED. 

Facial behaviors generated by speech usually contain a com¬ 
bination of the above expressions 13^ . 

Following this methodology, we developed two protocols. 
We dubbed the first protocol to generate FdMiee Protocol I. 
To validate this protocol, we acquired data from eight subjects 
with different characteristics. We applied low-resolution, high- 
resolution, and Infra-red cameras as acquisition hardware vari¬ 
ables. As performance parameter variables, we simulated multi¬ 
input expressions and environments to test the invariance and 
accuracy of facial tracking systems exposed to changes, e.g. 
different lighting conditions, universal-based and speech facial 
expressions. To validate the results, we executed 360 acqui¬ 
sitions and demonstrated the protocol’s potentials to acquire 
data containing uncontrolled scenarios and facial behaviors. We 
dubbed the second protocol to create FACIA database Proto¬ 
col II. This is an extension of Protocol Fs performance pa¬ 
rameters variables, introducing induced facial behaviors. To 
validate the results, we studied the protocol’s effectiveness for 
acquiring multimodal databases of induced facial expressions 
with speech, color, and depth video (3D data stream) data. To 
achieve this validation goal, we presented a novel induction 
method using emotional acting to generate facial behaviors in¬ 
herent to expressions. We also provided emotional speech in the 
Portuguese language, since currently there is not any 3D facial 
database that uses this language. Similar to FdMiee, in FACIA 
we created proof-of-concept through an experiment with eigh¬ 
teen participants, in a total of 504 acquisitions. 

As a typical protocols’ usage example, a research team has 
available database of 10 female subjects aged between 20-22. 
They would like to compile a database to test the head rota¬ 
tion tracking accuracy of a CV algorithm using a HD camera. 
Therefore, they define as subject characteristics the female gen¬ 
der and age range. Then, they choose a HD camera as acquisi¬ 
tion hardware and afterward need to pick the Facial parameter: 
head rotation as Performance parameter. Finally, they need to 
follow our validated FdMiee protocol. 

In summary, to follow the protocols, we first choose the pa¬ 
rameters to simulate as fixed Performance variables. This allow 
us to define the acquisition guidelines. Secondly, we determine 
the hardware variable and generate an acquisition setup. It is 
important to note that this variable is flexible, and thus chang¬ 
ing this variable will not impact the guidelines. The same is 
verified using different subject characteristics. 

4. Protocols and Validation 

In this section, we describe in detail the two protocols that 
follow our proposed methodology. Protocol I resulted in the 
FdMiee sample database that contains facial data from uncon¬ 
trolled scenarios. FdMiee focuses essentially on performance 
variable guidelines of external parameters. The obtained data 
was recorded with three types of acquisition hardware. As its 


extension. Protocol II focuses on testing and simulation of fa¬ 
cial parameters of the performance variables, using Microsoft 
Kinect as hardware. 

4.1. Protocol I 

Facial recognition and tracking systems are highly depen¬ 
dent on external conditions (i.e. environment changes) mi. To 
reduce this dependency, we developed a protocol based on our 
proposed methodology, for database creation with changes in 
terms of external parameters, such as light, background, occlu¬ 
sions, and multi-subject. For facial parameters, we setup guide¬ 
lines to capture variations in head rotation, as well as universal- 
based, contempt and speech facial expressions. Table sum¬ 
marizes the performance parameters acquired through this pro¬ 
tocol. 

4.7.7. Requirements 

As protocol requirements, we setup the acquisition hard¬ 
ware and equipment to simulate the selected external and facial 
parameters. 

Acquisition Hardware 

The chosen acquisition hardware simulates realistic scenar¬ 
ios captured using three types of hardware. To test the protocol 
guidelines, we chose the following equipment: 

• Low-Resolution (LR) camera 

• High-Resolution (HR) camera 

• Infra-Red (IR) camera 

The first two cameras (LR and HR) allow us to study the in¬ 
fluence of resolution on face tracking, face recognition, and ex¬ 
pression recognition 13^ . The IR camera allows us to disregard 
lighting variations (341 |35l |36l and provides a different kind of 
information than HR and LR cameras. The hardware used in 
this protocol should be aligned with one another to ensure fu¬ 
ture comparison between data acquired with different hardware. 

Environment-Change Generation Equipment 

To generate data with the defined parameters, we stabilize 
the following environment elements: 

Background A solid color and static background ease the pro¬ 
cess of detecting facial features and extracting informa¬ 
tion from the surrounding environment. The background 
should ideally be black (or very dark) to prevent inter¬ 
ference with the IR camera (black color has lowere re¬ 
flectance compared to lighter colors) 

Lighting The room must be lit up by homogeneous light, and 
not produce shadows or glitters in the subject’s face. By 
taking these measure, we ensure that the skin color will 
have no variations throughout the acquisition process. 
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Figure 2: Acquisition setup proposed for Protocol 1. 

4.1.2. Acquisition Setup 

The subject sits in front of the acquisition hardware. This 
hardware setup is composed of three cameras (LR, HR and IR). 
The subject’s backdrop should be black with some space be¬ 
tween them, to have the possibility of moving objects or sub¬ 
jects behind the main scene. This setup is exemplified in Figure 

m 

4.1.3. Protocol Guidelines 

To perform the acquisition, we suggested the presence of 
two members: one to perform the acquisitions (A) and the other 
to perform environment variations (B). The subject sits in front 
of the computer monitor and one of the team members aligns 
them with the cameras. During the entire acquisition procedure, 
the subject should remain as still as possible, to avoid producing 
changes during the various acquisition procedures. 

Before starting the experiment, each subject has access to a 
printed copy of the protocol. This reduces the acquisition time, 
since the subject already knows what is going to take place dur¬ 
ing the experiment. Each performance parameter simulated and 
introduced in the scenario has its own guidelines: 

Control Team member A takes a photo with the subject in the 
neutral face. 

Lighting Team member A takes 3 photos with different ex¬ 
posures (High, Medium, Low). This variable was only 
acquired in HR camera, because it is the only where it is 
possible to change the exposure level. 

Background Team prepare the background to the acquisition. 

1. Team member A starts recording; 

2. Subject stay still during 5 seconds while team mem¬ 
ber B performs movement if necessary (only case of 
dynamic background); 

3. Team member A stops recording. 

Multi-Subject While subject is being record, team member B 
appear in the scene during 10 seconds. 

Occlusions For total occlusion, subject will start in the center 
of the scene and will slowly move to a point out of the 
scene. For partial occlusions, a photograph is taken with 


a plain color surface, like a piece of paper covering the 
following parts of the face: 

• Top; 

• Left; 

• Bottom; 

• Right. 

Head Rotation For each head pose (Yaw, Pitch and Roll) sub¬ 
ject performs the movement in both directions while be¬ 
ing recorded through the complete movement. 

Universal-Based Facial Expressions, plus Contempt Subject 
repeat during 10 seconds the following emotion expres¬ 
sions, starting from the neutral pose to a full pose: 

• Joy; 

• Anger; 

• Surprise; 

• Fear; 

• Disgust; 

• Sadness; 

• Contempt. 

Speech Facial Expressions The subject reads a cartoon or text 
and is encouraged to express his feelings about it. 

4.1.4. Obtained Outputs 

This protocol generates the following output data: 

• HR and LR Photographies (.jpeg) 

• LR camera videos - 15fps (.wmv) 

• HR camera videos - 25fps (.mov) 

• IR camera videos - lOOfps (.avi) 

The emotions generated through variation of facial parame¬ 
ters are expected to contain a mixture of macro and micro (i.e. 
subjects can be repressing and suppressing feelings) as well as 
false (i.e. subject is making an effort to express certain emo¬ 
tions) and subtle (i.e. when subject cannot generate a high in¬ 
tensity expression) plus speech-based expressions. 

Data Organization and Nomenclature 

For standardization purposes and further analysis, a folder 
for each acquisition hardware was created. Inside these folders 
exist sub-folders for each of the tested performance parameters. 
The output files were placed in the respective folder with the 
following template naming convention: 

CaptureModeVolunteerOX_SimulationNameJakeOY.format 

, where CaptureMode is the type of hardware, X is the num¬ 
ber of the subject, SimulationName is the name of the perfor¬ 
mance parameter acquired and respective information and Y is 
the take’s identification number. 
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B. Webcam 


B. IR Camera 

Figure 3: FDMiee samples results for HD Camera (A), Webcam (B) and IR 
Camera (C) 

4.1.5. FDMiee Acquisition Protocol Validation 

Following the described protocol guidelines, we acquired 
data from eight volunteers with the following subject character¬ 
istics: 

Gender Male/Female; 

Glasses With/Without; 

Beard With different formats/Without. 

Age 20-35 years 

In Figure [^it is possible to see sample results from some of 
the performance parameters with the different acquisition hard¬ 
ware. 

4.2. Protocol II 

The definition and extraction of induced facial behaviors 
and speech features inherent to spontaneous expressions is still 
a challenge for CV systems. To develop and subsequently eval¬ 
uate a CV algorithm that achieves this goal, we proposed these 
two protocols to acquire a database containing, simultaneously, 
spontaneous facial expressions and speech information inher¬ 
ent to induced emotions, such as Ekman’s universal emotions 
(2911301. Therefore, in this experiment we focus on the defi¬ 
nition of guidelines to capture facial parameters changes in the 
performance variable 

4.2.1. Requirements 

We define two types of requirements: emotion induction 
method and equipment requirements. Emotion induction method 
is used as basis to define the protocol guidelines inherent to fa¬ 
cial parameters simulation. 

The Emotion Induction Method 

The majority of spontaneous facial expressions are gener¬ 
ated in real-life situations. To simulate these facial behaviors, 
we proposed a protocol where the system would ask for emo¬ 
tional acting in order to trigger facial responses from a subject. 


For this purpose, we combined a Mood Induction Techniques 1 
(MIT 1) described by Hesse A.G. et al. f2T\ with mood induc¬ 
tion sentences suggested by Pitas I. et al. (22l. As an applica¬ 
tion example, we could have a system that asks for certain user 
emotions expression through facial or speech features. The user 
must pronounce certain sentences with a particular tone and fa¬ 
cial expression, matching the required emotional state. Accord¬ 
ing to expression classification introduced in Section I, using 
this method we are able to induce macro, micro, false, masked, 
and subtle expressions. Macro expressions are implicit, since 
we ask for expression of the six of the Ekman universal emo¬ 
tions (i.e. anger, fear, sadness, disgust, surprise and happiness). 
However, since we are in an induced emotions context, sub¬ 
jects can have difficulty engaging in the proposed situation and 
generating micro, false and masked expressions. Also subtle 
expressions are triggered because subjects’ engaging intensity 
can be low in the induced sentence or context. As expected, the 
produced facial expressions depend of subjects’ interpretation 
and how they emerge themselves in the simulated situation. 

Our induction approach presents a novel view on emotion 
acting and their applications, though the domain still remains 
unexplored in state-of-the-art databases. 

We used common persons as subjects, instead of actors, to 
maintain the natural-ness of real-life scenarios and also achieve 
a larger diversity of facial behaviors. Actors gain, over time, 
professional skills that common population cannot reproduce, 
thus they might introduce features that cannot match the real- 
world human performance. Some available databases that use 
MIT 1, try to induce emotions in the users, asking them to 
imagine themselves in certain general and predefined situations 
tm [24ll . We also avoided this approach, since suggest¬ 
ing certain situations will not guarantee certain emotion expres¬ 
sions as output by the subject. This is due to the fact that differ¬ 
ent individuals have different reactions as responses. Therefore, 
in our protocol, we asked the subjects to imagine and create for 
themselves some personal mental situation, while they enact the 
pre-defined sentence. This aims to ensure an engaging adapta¬ 
tion and natural response from the subject. As mentioned be¬ 
fore, the chosen emotions were the six basic Ekman emotions 
(291 . due to their scientific acceptance and applicability in real- 
world situations. The sentences are pronounced in the European 
Portuguese language to match the users mother tongue. This is 
another contribution of our work, since currently there is not a 
Multimodal European Portuguese database available. 

Equipment and Environment Requirements 

The acquisition setup uses the Microsoft Kinect as acquisi¬ 
tion hardware variable. Kinect records 3D data stream as well 
as speech information. The illumination is not controlled how¬ 
ever, as acquisitions were executed during different day peri¬ 
ods under uncontrolled lighting conditions. The background is 
static and white, and there is no sound isolation, since speech 
signal can be affected by external noise. Sentences are dis¬ 
played on a screen positioned in front of the subject. To allow 
further synchronization or re-synchronization, a sound and light 
emitter is used in the beginning of each recording (see example 
of Figure |^. For this experiment in our protocol validation. 
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Figure 4: Example of our video-audio synchronizer (left). 



Table 2: FACIA emotion induction method: Sentences pronounced and acted 
by the subjects. 


Emotion 

Sentences 

Neutral 

A jarra esta cheia com sumo de laranja 

Anger 

0 que? Nao, nao, nao! Ouve, eu preciso deste 
dinheiro! 

Tu es pago para trabalhar, nao para beberes 
cafe. 

Disgust 

Ah, uma barata! 

Ew, que nojo! 

Fear 

Oh meu deus, est alguem em minha casa! 

Nao tenho nada para si, por favor, nao me 
magoe! 

Joy 

Que bom, estou rico! 

Ganhei! Que bom, estou tao feliz! 

Sadness 

A minha vida nunca mais sera a mesma. 

Ele(a) era a minha vida. 

Surprise 

E tu nunca me tinhas contado isso?! 

Eu nao estava nada a espera!. 


Figure 5: Acquisition setup proposed by FACIA protocol 

we developed a software that allows simultaneous recording 
of color and depth video with speech from Microsoft Kinect, 
in .bin, video, and audio formats. This software includes the 
Facetracker’s Microsoft SDK, and also saves the information 
retrieved from this algorithm. 


2. Acquisition team member B uses the light/sound 
synchronizer. 

3. Subject performs the emotion acting. 

4. Acquisition team member A stops the record¬ 
ing. 

4.2.4. Obtained Outputs 

Using our acquisition protocol, we obtained the following 
data per sentence enacted: 


4.2.2. Acquisition Setup 

The subject sits in front of the capture hardware. Distance 
between subject and Microsoft Kinect should be more than 1 
meter to enable facial depth capture. A screen displays the sen¬ 
tence that is currently going to be ’’acted”. The subject did not 
watch the recordings neither observe their own acting, to avoid 
auto-evaluation or influence their acting performance and ex¬ 
pressivity. In FACIA protocol we propose the acquisition setup 
of Figure 


• Video Color Resolution - 30fps (.bin); 

• Depth image Resolution - 30fps (.bin); 

• Audio pcm format 16000 Hz (.bin). 

• Facetracker SDK and Action Units detected (.bin). 

• Audio file (.wave). 

• Color Video file (.avi). 


4.2.3. Protocol Guidelines 

Each subject sits in front of the screen and acts out the two 
sentences per emotion while their voice and face expression 
are recorded. Per sentence we execute the procedure two times. 
This ensures the integrity of final results. We suggest a mini¬ 
mum of two members (A and B) in the acquisition team. Before 
starting the experiment, a protocol describing the experiment is 
given to the subject. 

The experiment starts by a neutral sentence |[37ll (that can 
be used as baseline for further experiments). 

Therefore, to each sentence of Table|^the following pipeline 
is repeated two times: 

1. Acquisition team member A says 1,2,3 I will 
record!. 


As explained, regarding facial behaviors we are able to gen¬ 
erate data containing macro, micro, false, masked and subtle 
expressions. 

Data organization and Nomenclature 

Similarly to procedure adopted in FdMiee protocol, we pre¬ 
define how data acquire is going to be organized. To each sub¬ 
ject is created a folder called VolunteerOX, where X is the num¬ 
ber associated to the subject. Inside each subject folder are cre¬ 
ated eight additional folders: one per emotional sentence. In¬ 
side of each emotion folder we will have two folders numbered 
with corresponding sentence, where we will place three data 
types obtained. Regarding file names, we will use the follow¬ 
ing template: 

VolunteerOXEmotionSentenceOYTakeOZ.format 
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Induced Facial Expressions 



A.Fear 




B. Disgust 



Figure 6: Sample of results obtained for fear (A) and (B) disgust emotion acting 


Where X is the subject number, Y the sentence number and 
Z the take number. 

4.2.5. FACIA acquisition Protocol validation 

To validate Protocol II, we follow it for eighteen subjects, 
in a total of 130 files per subject (total of 504 acquisitions). 
As subject characteristics variable we have seven female and 
eleven male; ages are in a range of 20-35 years old and they 
were all Caucasian. As already explain, we require depth in¬ 
formation so a Microsoft Kinect was used as acquisition hard¬ 
ware. As sample of results acquired during validation we can 
observed the Figure 

5. Discussion and Conclusions 

In this paper, we presented a methodology to facilitate the 
development of two facial data acquisition protocols. Follow¬ 
ing this methodology, we presented the protocols for simulation 
and capturing of real-life scenarios and facial behaviors. To val¬ 
idate the protocols, two sample databases were created: FdMiee 
and FACIA. They contain comprehensive information on facial 
variations inherent to both spontaneous and non-spontaneous 
facial expressions under a wide range of realistic and uncon¬ 
trolled situations. Generated databases can be used in a vari¬ 
ety of applications, such as CV systems evaluation, testing, and 
training Cl. They also serve as proof-of-concept. Adopting 
our methodology and following our protocols reduces the time 
required for customized database acquisition. 

Throughout the protocol creation process, we characterized 
two groups of variables: flexible variables (subjects’ character¬ 
istics and capture hardware) and fixed performance variables 
(external and facial parameters). The first protocol focuses on 


external parameters’ simulation as variation of the fixed perfor¬ 
mance variable. As an extension, the second protocol provides 
guidelines to induce and capture real-life facial behaviors as 
fixed performance variables. 

Protocol I allows the acquisition of a facial database con¬ 
taining a large number of fixed parameters’ variations (external 
and facial): lightning, background, multi-subject, occlusions, 
head rotation, universal-based, and speech facial expressions 
(Table [^. Lighting variations introduce changes in facial fea¬ 
tures (e.g. contrast and brightness) ca These variations en¬ 
able us to test how CV systems react to and detect, and how 
tracking is affected. Static and dynamic variations in the back¬ 
ground usually interfere with CV systems’ performance while 
detecting and tracking faces 1^ . Therefore, in this protocol, 
we simulate different background contexts, as well as intro¬ 
duce static and dynamic features in the environment. Similar to 
background variables, we simulate multi-subject environments, 
since this situation usually interferes with, and at times, dis¬ 
ables CV systems’ feature detection 111. Occlusions generated 
by glasses or hardware are also common in real-life scenarios, 
influencing face recognition and emotion classification accu¬ 
racy ||25]|26l[27l. The increase of Head-Mounted-Displays us¬ 
age in Virtual Reality applications makes it crucial to test sys¬ 
tems invariance while using these variables. Regarding facial 
behaviors, we reproduced and captured two kinds of facial be¬ 
haviors - universal-based and speech-based facial expressions. 
Universal-based Facial Expressions are related to pure emo¬ 
tions 1291 . They provide data for emotion recognition systems 
and enable the testing of systems invariance while subjects’ 
faces change expressions. Speech Facial Expressions, on the 
other hand, are inherent to all types of expressions 1^ (as 
showed in the image and enable the measuring of systems 
accuracy and precision. To validate Protocol I, we performed 
an acquisition on eight subjects with different subject charac¬ 
teristics, leading to the creation of EdMiee database. EdMiee 
contains facial behaviors under different environment contexts. 
Hence, this protocol enables the generation of databases that 
are useful for a wide range of CV systems performance tests. 

Protocol II extends the first protocol regarding facial behav¬ 
iors and performance variables, by introducing induced facial 
features. To achieve this, we proposed an emotion induction 
method, where facial expressions were induced through emo¬ 
tional acting. Analysing EACIA generated in the validation 
process, we verified that facial behaviors inherent to certain 
emotional acting are indeed different among individuals; i.e. 
subjects performed different acts to realize identical emotional 
states. Analysing subjects’ facial behaviors, we were able to 
simulate all types of expressions according to subjects interpre¬ 
tations and engaging in induction sentences. Hence, this proto¬ 
col provides a large and heterogeneous set of facial behaviors, 
useful for determining the accuracy of tracking and recognition 
systems. This was intuitively expected, since expressions in¬ 
herent to emotional states share some action units 1291 . This 
mixing of expressions can compromise database usage to train 
a machine learning classifier in pure expressions recognition, 
increasing classification error. Microsoft Kinect was chosen as 
the acquisition hardware variable, so that we could record three 
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kinds of data: color, depth (3D facial information) and speech. 
Introducing depth in the stored data provides valuable informa¬ 
tion a. However, recent studies point out that acquisition rate 
of Kinect is not sufficient for micro and subtle expressions cap¬ 
turing 1291 [Sni- This argument explains the poor component 
of micro and subtle expressions present in FACIA. However, 
in our methodology we classify this variable as flexible, to en¬ 
sure that protocol guidelines can be used with other acquisition 
hardware, i.e. guidelines can be applied with high frame rate 
cameras and improve the capture of these facial behaviors. The 
speech recording also allows the Portuguese emotional data col¬ 
lection, opening novel research lines in emotion classiflcation 
and recognition present in the European Portuguese language 
speech. 

In conclusion, our proposed methodology facilitate the gen¬ 
eration of facial data acquisition protocols. This methodol¬ 
ogy provides a tool for researchers to develop their own fa¬ 
cial databases. It also enable performance tests, validation and 
training processes in CV systems in a wide range of life-like 
scenarios and facial behaviors, being adaptable to different sub¬ 
ject characteristics and acquisition hardware. 

6. Future Work 

Our further work will focus on the following key tasks: 
First, we aim to enlarge our proof-of-concept sample databases 
and, subsequently, perform a statistical validation of the two 
protocols presented in this paper. Enlarging the databases will 
provide sufficient data for statistical validation, using various 
CV systems. The statistical validation will also provide more 
measurable information regarding data significance and impact. 
Second, we aim to devise more parameters for methodology 
variables to refine the validation process. Third, we aim to in¬ 
troduce a more heterogeneous subject samples, with a wider 
age range (thus greater presence of wrinkles and facial pig¬ 
ments), skin colors, and make-up. Fourth, we intend to carry 
out tests with more sophisticated acquisition hardware, such as 
high-speed cameras. And Anally, to increase our work appli¬ 
cability, we intend to extend the fixed variable of performance 
parameters, providing more guidelines to generate novel situa¬ 
tions. 
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